class: center, middle, inverse, title-slide # Lecture 26 ## Generalized Linear Models: Null model ### Psych 10 C ### University of California, Irvine ### 06/01/2022 --- ## Generalized linear models - Last class we introduced the topic of generalized linear models. -- - The main motivation was to stop using the Normal distribution for our models due to the fact that its assumptions (e.g. that the random variable can take any value between `\(-\infty\)` and `\(\infty\)`) were not met by our data. -- - However, we still want to use a linear model because we know how to interpret its parameters `\((\beta_k)\)`. -- - The first distribution that we used was the Binomial, which is a probability distribution for the sum of independent `\(0\)`, `\(1\)` outcomes obtained throughout `\(n\)` observations. --- ## Logisticc regression - In the logistic regression model, we assume that observations follow a Binomial distribution with probability `\(\theta\)` and `\(n\)` independent samples. `$$y_i \sim \text{Binomial}(\theta_i,n)$$` -- - This means that each of our `\(n\)` observations has a probability `\(\theta\)` of taking the value `\(1\)` and a probability `\(1-\theta\)` of taking the value `\(0\)`. -- - If we can add all of our observations then we get a single number representing the number of `\(1\)`'s observed in our sample. Therefore, this number can only take values between `\(0\)` and `\(n\)`. -- - For now, we will assume that all participants are independent and identically distributed, however, we will compare our models by using the independent 0, 1 responses. --- ## Covid test and vaccination status - We will continue with our COVID test and vaccination status example. -- - For this problem, we will assume that we have access to a perfect test where, if a person tests positive then that means that they definitely have COVID, and if they test negative then they don't. -- - We take a sample of 40 participants from two different populations, one that is not vaccinated and one where everyone is vaccinated. -- - We ask participants to take the COVID test, then we count the total number of COVID cases detected in each population and summarize those values in a table. -- | Vaccination status | Sample size | Positive tests | Proportion Pt | |--------------------|:-----------:|:--------------:|:-------------:| | Not vaccinated | 40 | 10 | 0.25 | | Vaccinated | 40 | 2 | 0.05 | --- ## Data visualization - We can also use a bar graph to visualize our data: -- .pull-left[ ```r covid %>% count(test_pcr,status) %>% group_by(status) %>% mutate(n = n/sum(n) * 100) %>% ggplot() + aes(x = status, n, fill = test_pcr, label = paste0(round(n, 2), "%")) + geom_col() + theme_classic() + xlab("Vaccination status") + ylab("Proportion of (+) cases") + theme(axis.title.x = element_text(size = 30), axis.title.y = element_text(size = 30), legend.text = element_text(size = 15), legend.title=element_text(size = 20)) + labs(fill = "Test result") + geom_text(position=position_stack(0.5)) ``` ] .pull-right[ <br> <img src="data:image/png;base64,#lec-26_files/figure-html/bar-out-1.png" style="display: block; margin: auto;" /> ] --- ## Logistic regression - Before applying the transformations that we talked about last week (Lec 25), let's introduce the two models that we will use and compare. -- - As before, we have a "**Null**" model and a "**vaccination status**" model. Let us start with the Null model. -- - The Null model can be formalized as follows: `$$y_{i} \sim \text{Bernoulli}(\theta)$$` - Where `\(y_i\)` represents the test result of the *i-th* participant (negative or positive), and `\(\theta\)` represents the probability that the *i-th* participant gets a positive result. -- - Given that we assume that our observations follow a Bernoulli distribution with probability `\(\theta\)`, the expected value for any observation `\(y_i\)` is `\(\mathbb{E}(y_i) = \theta\)`. -- - Remember that in order to use a linear regression, we need to get rid of the bounds in `\(\theta\)` and to do so, we need to refer to the logarithm of the odds. --- ## Null model - So, for the Null model: `$$\ln\left(\frac{\theta}{1-\theta}\right) = \beta_0$$` -- - Now that we have a linear model for the logarithm of the odds we need to be able to go back to the probability `\(\theta\)` in order to produce the predictions made under the Null model. -- - We can do this using algebra. We start by getting rid of the natural logarithm by exponentiating both sides of the equation using the value `\(e\)`: `$$\exp\left\{ln\left(\frac{\theta}{1-\theta}\right) \right\} = \exp\left(\beta_0\right)$$` -- `$$\frac{\theta}{1-\theta} = exp\left(\beta_0\right)$$` --- ## Null model - Now, we solve the equation for `\(\theta\)`: `$$\theta = (1-\theta)e^{\beta_0}$$` -- `$$\theta = e^{\beta_0}-\theta e^{\beta_0}$$` -- `$$\theta + \theta e^{\beta_0} = e^{\beta_0}$$` -- `$$\theta \left(1+ e^{\beta_0}\right) = e^{\beta_0}$$` -- `$$\theta = \frac{e^{\beta_0}}{1+ e^{\beta_0}}$$` -- - Thus, according to the Null model, the probability that any participant will test positive on a test is equal to `\(\frac{e^{\beta_0}}{1+ e^{\beta_0}}\)` -- - As in previous cases, the Null model assumes that vaccination status does not have an effect on the probability of testing positive on the results obtained from our "perfect" COVID test. --- ## Null model - The parameter `\(\beta_0\)` in the Null model is not as easy to interpret as in other linear models. However, we can interpret a transformation of this parameter. -- - From the previous equations we know that: `$$\frac{\theta}{1-\theta} = \exp\left(\beta_0\right)$$` -- - This means that we can interpret the value of `\(e^{\hat{\beta}_0}\)` as *how much probable is it to get a Positive result for COVID than it is to get a negative result*. --- ## Interpretation - For example: -- - If `\(0<e^{\hat{\beta}_0}<1\)`, then we know that the probability of testing positive for COVID regardless of vaccination status (because the Null assumes that vaccination status makes no difference) is less than the probability of getting a negative result. -- - If `\(e^{\hat{\beta}_0}=1\)`, then the probability of testing positive is equal to the probability of testing negative. -- - If `\(1<e^{\hat{\beta}_0}\)` then the probability of testing positive is greater than the probability of testing negative. -- - In comparison to other linear models, the value of `\(\beta\)` in a Generalized linear model is not as easy to find, so we will always need to use a computer. --- ## Estimating `\(\beta_0\)` - To estimate the parameter `\(\beta_0\)` in the logistic regression model we need to use a new function in R, the function is **`glm()`**. -- - Similar to the **`lm()`** function we saw before, the **`glm()`** function takes 3 arguments. First, just as before, we need to specify the **`formula`** in the form `\(y \sim x\)` and the **`data`**. -- - However, this time we also need to specify the **`family`** of the distribution of our data, which in this case corresponds to the Binomial. -- - For example, we can get our estimate of `\(\beta_0\)` using the following code: ```r betas_null <- glm(formula = test ~ 1, data = covid, family = binomial)$coef ``` .pull-left[ `\(\hat{\beta}_0 =\)` -1.73 `\(e^{\hat{\beta}_0} =\)` 0.18 ] .pull-right[ `\(\hat{\theta} = \frac{e^{\hat{\beta}_0}}{1+e^{\hat{\beta}_0}} =\)` 0.15 ] --- ## Visualizing the predictions of the Null - We can add the predictions of the Null model to our bar graph to compare our results against these predictions visually. <img src="data:image/png;base64,#lec-26_files/figure-html/bar2-out-1.png" style="display: block; margin: auto;" /> --- ## How good is the model? - One disadvantage that we face now is that we no longer have a simple way to evaluate the adequacy of our models. -- - Before, we used to be able to compute the error. But the notion of error does not make much sense in this new setting because now our observations can only take one of two possible values. -- - In order to perform model comparisons, we need a new way to measure the *fit* of each model. -- - The question that we should be interested in is: - If I have a value of `\(\hat{\theta}\)` that represents the prediction my model (in this case the Null model) would make, how likely would it be that we observe the data that we actually have just observed? --- ## Model predictions and model fit - Remember that in order to compare any set of models, we need at least two pieces of information. -- - First is the **model fit**, ("how good is this model?"). Second, is the **model complexity**, ("how many parameters do I need to estimate under this model to make a prediction?"). -- - As previously described, the model fit in generalized linear models is captured by a number that indicates how likely our observations are (i.e. the total number of positive and negative tests registered) according to our model. -- - In some sense, we start by assuming that our model is correct, and then we evaluate how likely our data is according to its predictions (as defined by the estimated value of `\(\theta\)`). -- - In the case of the Null model, we can formalize its predictions as: `\(\hat{\theta} = \frac{e^{\hat{\beta}_0}}{1+e^{\hat{\beta}_0}} =0.15\)` --- ## Predictions and fit - First, we add the predictions of the Null model to the data: ```r covid <- covid %>% mutate("prediction_null" = exp(betas_null) / (1 + exp(betas_null))) ``` -- - With the predictions added, now we can calculate the model fit. For this, we need to use the density of the Binomial distribution. In R we can calculate this density with the function **`dbinom()`**. -- - The function **`dbinom()`** takes 3 arguments: **`x`** which represents the value of the observation. In our case, `x` would correspond to whether the test came back as negative or positive. -- - The second argument is the **`size`**, which corresponds to the number of independent observations included. In this case, given that we are considering a single observation (without adding them), the size would be equal to 1. -- - The third argument that the function takes is **`prob`**, or the probability that the variable takes the value 1. In our case, this probability is equal to the prediction of the model `\(\hat{\theta}\)`. --- ## Log-likelihood - Remember that when we have independent observations, the probability of a sequence of `\(0\)`'s and `\(1\)`'s is equal to the product of the probabilities. -- - Given that these numbers would be too small and cause problems for any computer to process, we can work with the natural logarithm instead. The advantage here will be that the natural logarithm transforms multiplications into sums! -- - We can get the logarithm of the density of the binomial distribution by adding **`log = TRUE`** to the **`dbinom()`** function in R. -- - The values that we get are known as the log-likelihood of each observation and are central to a large branch of statistics. -- - We can add the log-likelihood of each individual observation to our data using the following code: -- ```r covid <- covid %>% mutate("loglikelihood_null" = dbinom(x = test, size = 1, prob = prediction_null, log = TRUE)) ``` --- ## BIC - Using the log-likelihood values that we just calculated, we can now compute the BIC for this model. -- - Remember that the BIC takes two components: `$$BIC = \text{fit} + \text{complexity}$$` -- - The complexity part has not changed, and so it will be equal to: `$$\text{complexity} = k ln(N)$$` - Where `\(k\)` represents the number of parameters in the model and `\(N\)` represents the total sample size in the experiment (how many observations we have). -- - However, the model fit is now defined as: `$$\text{fit} = - 2 \sum_{i=1}^{N} ll(y_i)$$` - where `\(ll(y_i)\)` represents the log-likelihood associated to the *i-th* observation in the experiment. --- ## BIC - For the binomial model, the value of the sum of the log-likelihoods `\(\sum_{i=1}^{N} ll(y_i)\)` will always be between `\(-\infty\)` and `\(0\)`. -- - The closet it is to `\(0\)` the better the model is (more negative values in a GLM indicate worse model predictions). -- - When we multiply that value by a negative number that means that the larger the value of *fit* the worse the model will be, and so we should prefer models that have smaller BIC values. -- - For our Null model then we can compute the BIC as: ```r # Total sample size n_total <- nrow(covid) # Complexity, the number of parameters of the Null is equal to 1 complexity <- 1 * log(n_total) # Model fit, we can add all the values of the log-likelihood fit <- -2 * sum(covid$loglikelihood_null) # The BIC is the sum of model fit and model complexity bic_null <- fit + complexity ``` --- ## BIC - The BIC associated with the Null model is equal to 72.02. -- - As with other problems, we have no use for the BIC in isolation. -- - Remember that we only use the BIC to establish comparisons between our models and not to make any statements about the data or any of the models themselves. -- - This means that we first need to do the same computations for a second model, the one that assumes that the probability of testing positive depends on the group that you belong to. -- - Once we get the BIC value for that second model we will be able to make the comparisons that we need in order to answer our scientific question.